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Abstract 

This paper presents a novel communication-efficient parallel belief propagation 
(CE-PBP) algorithm for training latent Dirichlet allocation (LDA). Based on the 
synchronous belief propagation (BP) algorithm, we first develop a parallel belief 
t> . propagation (PBP) algorithm on the parallel architecture. Because the extensive 

' communication delay often causes a low efficiency of parallel topic modeling, we 

ON 1 further use Zipf 's law to reduce the total communication cost in PBP. Extensive 

experiments on different data sets demonstrate that CE-PBP achieves a higher 
topic modeling accuracy and reduces more than 80% communication cost than 
^sO , the state-of-the-art parallel Gibbs sampling (PGS) algorithm. 
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1 Introduction 



'k^ • Topic modeling for massive data sets has attracted intensive research interests recently, because 

large-scale data sets such as collections of images and documents are becoming increasingly com- 
mon lIU 12 [3J ID . Online and parallel topic modeling algorithms have been two major strategies 
for massive data sets. The former processes the massive data stream by mini-batches, and discards 
the processed mini -batch after one look (2]|4l. The latter uses the parallel architecture to speed up 
the topic modeling by multi-core/processor and more memory resources |fl] [3]. Although online 
topic modeling algorithms use less computational resources, their topic modeling accuracy depends 
on several heuristic parameters including the mini-batch size ||2]|4), and is often comparable or less 
than batch learning algorithms. In practice, online algorithms are often 2^5 times faster than batch 
algorithms [4|, while parallel algorithms can get 700 times faster under 1024 processors [ 1 1. Because 
the parallel architecture becomes cheaper and widely-used, the parallel topic modeling algorithms 
are becoming an ideal choice to speed up topic modeling. However, parallel topic modeling is not 
a trivial task, because its efficiency depends highly on extensive communication/synchrononization 
delays across distributed processors. Indeed, the communication cost determines the scalability of 
the parallel topic modeling algorithms. 

In this paper, we propose a novel communication-efficient parallel belief propagation (CE-PBP) al- 
gorithm for training latent Dirichlet allocation (LDA) [5 1, one of the simplest topic models. First, 
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we extend the synchronous BP algorithm [6| to PBP on the parallel architecture for training LDA. 
We show that PBP can yield exactly the same results as the synchronous BP. Second, to reduce 
extensive communication/synchrononization delays, we use Zipf's law [7| to determine the commu- 
nication rate of synchronizing global parameters in PBP. Using the different communication rates, 
we show that the total communication cost can be significantly reduced. Extensive experiments 
confirm that CE-PBP reduces around 85% communication time, and achieves a much higher topic 
modeling accuracy than the state-of-the-art parallel Gibbs sampling algorithm (PGS) fl]|8]. 



2 Parallel Belief Propagation (PBP) 

LDA allocates a set of semantic topic labels, z = {z^ d }, to explain non-zero elements in the 
document-word co-occurrence matrix xwxd = {%w,d}< where 1 < w < W denotes the word 
index in the vocabulary, 1 < d < D denotes the document index in the corpus, and 1 < k < K 
denotes the topic index. Usually, the number of topics K is provided by users. The topic label 
satisfies 2^ d — {0, 1}, Ylk=x z w d = After inferring the topic labeling configuration over the 
document-word matrix, LDA estimates two matrices of multinomial parameters: topic distributions 
over the fixed vocabulary 4>wxK — {^-,k}> where 9. t d is a isT-tuple vector and 4>-,k is a VK-tuple 
vector, satisfying J^k ®k,d = 1 and J2 W <f>w,k — 1. From a document-specific proportion 8.d, LDA 
generates a topic label z k d = 1, which further combines <j>.^ to generate a word index w, forming the 
total number of observed word counts x w .d- Both multinomial vectors 6. t d and 4>-,k are generated by 
two Dirichlet distributions with hyperparameters a and j3. For simplicity, we consider the smoothed 
LDA with fixed symmetric hyperparameters provided by users [|9j|. 

After integrating out the multinomial parameters {cf>, 8}, LDA becomes the collapsed LDA in the 
collapsed hidden variable space {z, a, (3}. The collapsed Gibb sampling (GS) J9J is a Markov 
Chain Monte Carlo (MCMC) sampling technique to infer the marginal distribution or message, 
Hw,d.i{k) — p(z^ j j = 1), where 1 < i < x W: d is the word token index. The message update 
equation is 

z k ■ + & 



where z k d _ 4 = Yl w z w,d,-i< z t,-,-i = 12d z t.d,-i> an ^ tne notation —i denotes excluding the 
current topic label d i . Then, GS randomly samples a topic label d i = 1 from the message, 
and immediately estimates messages of other word tokens. 

Unlike GS, BP [6] infers messages, fi Wt d(k) = p(z^ d — 1)> without sampling in order to keep all 
uncertainties of messages. The message update equation is 

HwA k ) \M-w.d ( k ) +a\x =—r — , (2) 

h, w \Pw,-dW +P\ 

where = Y!,-w %-w,dfJ>-w,d(k) and fi w _ d (k) = J2-d x ™~dHw-d(k). The notation 

— w and —a denote all word indices except w and all document indices except d. Eq. (f2]l differs 
from Eq. (fl~|i in two aspects. First, BP infers messages based on word indices rather than word 
tokens. Second, BP updates and passes complete messages without sampling. In this sense, BP can 
be viewed as a soft version of GS. Obviously, such differences give Eq. (fJJ two advantages over 
Eq. (|T). First, it keeps all uncertainties of messages for higher topic modeling accuracy. Second, 
it scans the number of non-zero elements (NNZ) for message passing, which is significantly less 
than the total number of word tokens d x -w,d in x. So, BP is often faster than GS by scanning a 
significantly less number of elements (NNZ J2 W d x w,d) at each training iteration [151 . 

Based on the parallel architecture, we propose the parallel belief propagation (PBP) algorithm to 
speed up the synchronous BP. First, we define two matrices, 

Ok,d = J~]xiv,dfhu,d(k), (3) 

W 
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so that we can re-write iff) as 



Vw,dW (x [ 9 kd+ a \ x — > (5) 

where — w and — d denote excluding x w .dHw,d(k) from the matrices (01 and (0. At each training 
iteration t, 1 < t < T, the synchronous BP updates the message (0 using (0 and © at t — 1 iteration 
for non-zero elements in x. The updated messages are then used to estimate two matrices 01 and (0 
at t iteration. After T iterations, the synchronous BP stops and normalizes ^ fe Qk,d — 1 an d 
Sui 4>iu k = 1 t0 obtain the multinomial parameters O^d an d TO 

PBP distributes D documents into 1 < m < M processors. Thus, the matrix Okxd can be also 
distributed into M processors as 9k,d,m, but the matrix 4>wxK is shared by M processors. At each 
training iteration t, each local processor m sweeps the local data x Wl d,m using Eqs. (0 to Q. The 
updated local 9k,d,m is independent, but the updated local <p w k m should influence each other across 
M processors. So, we need to communicate each local <fi w k m in order to synchronize the global 
matrix 4> w k , 

M 

4>w,k <~ 4>w,k + (^w,k,m ~ <l>w,k)i ( 6 ) 
m— 1 

After synchronization, we copy the global <p w h to local <p w k m for the next training iteration, 

<- 4>w,h- C7) 

Because PBP follows the synchronous schedule, it produces exactly the same results of the syn- 
chronous BP [6|. Notice that the parallel Gibbs sampling algorithm (PGS) |fl] ID is an approximate 
solution to GS in ((TJ, because GS uses an asynchronous schedule for message passing |6]. 

According to Eqs. (O and ©, PBP needs to communicate and synchronize a total of 2 x M x 4>ky.w 
matrices at each training iteration t. Let us take WIKI data set in TableQ]as an example. If K = 10 
and M = 32, we need to communicate 400MBytes in the parallel architecture at each training 
iteration. This communication cost is so high as to delay synchronization. For a total of T training 
iterations, the total communication cost is calculated as 

Total communication cost = 2 X M X T X 4>kxw- (8) 

Notice that the communication cost of PGS IUE1 can be also calculated as (H), but with the following 
major difference. In a common 32-bit desktop computer, PBP uses the double type (8 byte) but PGS 
uses the integer type (4 byte) to store the matrix 4>yy x x ■ So, PGS requires only half communication 
cost as PBP, i.e., around 200MBytes in the above WIKI example. Because this communication cost 
is still a bottleneck, to reduce dHJ, PGS changes the communication rate by running © and (|7) at 
every T' > 1 training iterations [ 1 1, so that the total communication cost can be reduced to a fraction 
1/T" of ([§}. However, the low communication rate slows down the convergence and degrades the 
overall topic modeling performance of PGS (T). As a result, PGS suggests running © and (|7]i at 
every training iteration, which causes a serious communication delay. In CE-PBP, we aim to reduce 
the total communication cost ([8]) using Zipf 's law without degrading the overall topic modeling 
performance very much. 



3 Reduce Communication Costs by Zipf 's Law 

Zipf's law [jl0] reveals that the word frequency rank r has the following relationship with the word 
frequency / in many natural language data sets , 

logr = C-#log/, (9) 

where C and H are positive constants. Zipf's law indicates that the logarithm of the word rank in 
the frequency table is inversely proportional to its logarithm of frequency. Generally, the frequency 



3 




Figure 1 : Zipf 's law regulates different communication rates of different parts. The top panel shows 
four Zipf's curves for the four data sets in Table [1] The bottom panel shows the uniform partition 
of the matrix 4>wxK mto 16 parts according to Zipf's curves. In case of T — 16 training iterations, 
the part with rank 1 communicates 16 times, while the part with rank 16 communicates only once 
when H = 1. 



Algorithm 1: The CE-PBP Algorithm. 



Input: x, T, K, M, N, a, (3 
Output: <j> WxK ,9 KxD 

1 Distribute x w ^, m to M processors; 

2 Random initialization: global cj) w k and local 6 k d m ; 

3 Copy global cj> wk to local processor: cj) wkm <- (j> w k ; 

4 for t <- 1 to T do 

for each processor m in parallel do 

for each part with rank r in communication by Zipf's law do 
I -t -t-i 

Update local sub-matrices: <pw k ^— <pi 
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is end 



end 

for d 



1 to D, w <r- 1 to W, k 



„ > K ' 

1 to K, x w ^ m ^ do 

— d 



.+W0 



Vl.d,m( k ) K [ e k.d,m +0\X d 

end 

4 > w,k,m ~ '^2d Xw , d , m f J 'w,d,m( k: ) > 

end 

// Zipf's law based communication and synchronization 
for each part with rank r in communication by Zipf's law do 



Update global sub-matrices: (f>w K «— <\>w_^ K + J2 m =i(^ )w Km ~ ( t >w k)>'< 



end 



of a word determines its contribution to message passing as shown in ([3]l and (0]). So, the higher 
word rank corresponds to the more contribution to topic modeling. 
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To reduce the total communication cost ©, we aim to reduce the size of matrix 4> WxK at each 

training iteration. First, we uniformly partition the matrix 4>y/ X K m terms of W into N parts, 
where the first part with rank r = 1 contains the most frequent words in the vocabulary and so on. 
If we sort the word frequency for different parts in a descending order, we can plot the approximate 
Zipf's curves for the four data sets in Table [1] according to (|9j as shown in Fig.Q] As a result, we 
obtain N sub-matrices 4>w xK , satisfying W/N <C W. Here, we use different communication rates 
for different parts, 

Communication rate = r H , (10) 

where r is the part rank in terms of word frequency and H is the slope of Zipf curve in Eq. ((TJ. 
When the slope H is large, the word frequency is small in part with large rank r, which has a low 
communication rate to save time. When H = 1, the part with rank r communicates \T/r\ times 
in T training iterations, where |_-J is the floor operation. As a result, the part with rank r starts 
communicating if and only if the current iteration t is multiples of r. For example, when T = 100, 
the part with rank 16 will communicate when the current iteration t E {16,32,48,64,80,96}. Based 
on Eq. ([Tol l, the total communication cost ^ becomes 

J2 N _ \T/r H \ 

Reduced communication cost = 2 x M x — r ~ 1 ^ r x 4>wxKi (H) 

where |_-J is the floor operation. Because we use different communication rates [T/r H \ for different 

parts with rank r, Eq. (fTTb is significantly smaller than dHJ, i.e., ~ =1 N — - <C T. Let us take WIKI 
data set in Table [TJ as an example. If K = 10 and M = 32, we need to communicate/synchronize 
200GBytes in the parallel architecture for T = 500 iterations according to (|8). In our strategy, if 
N = 100, we require only around lOGBytes for communication/synchronization according to (fTTb . 
which takes only 5% communication cost in ([8]). 

The Zipf's law based communication rates are reasonable because the global matrix 4> w x K is dom- 
inated by the high frequent words in (@). So, the high communicate rate for the sub-matrix with top 
word frequencies ensures the accuracy of the global 4>y/xK- Compared with the uniform commu- 
nication rate for the entire matrix <fiw X K m PGS, the different communication rates for different 
sub-matrices <fiw xK are more effective. The experiments on several data sets confirm that the pro- 
posed communication method degrades only 1% topic modeling accuracy measured by the training 
perplexity, but gains much higher parallel topic modeling efficiency. 

The communication-efficient parallel belief propagation (CE-PBP) algorithm is summarized in Al- 
gorithm!??] From Line 1 to 3, we distribute the document-word matrix xwxd into M processors, 
and randomly initialize the global and local parameter matrices. During each training iteration t, 
we perform the parallel message passing independently in M processors. At the end of each train- 
ing iteration, we communicate and synchronize the global parameter matrix according to Zipf's law 
based communication rates. Therefore, CE-PBP reduces the total communication cost by different 
communication rates for different sub-matrices <fiw xK . 



4 Experiments 

We use four data setffi KOS, NIPS, ENRON and WIKI in TableQl where D denotes the number of 
documents, W denotes the size of vocabulary, NNZ denotes the number of non-zero elements in 
the document-word matrix. Since KOS is a relatively smaller data set, we use it for parameter tuning 
in CE-PBP. The number of topics, K = 100, is fixed in all experiments except for special statements. 
The number of training iterations T = 500. We use the same hyperparameters a = ft = 0.01 for 
CE-PBP and PGS. Due to limited computational resources, we use only 32 processors to compare 
the performance of CE-PBP and PGS. We find that the communication cost follows Eqs. ^ and ( fTTt , 
so that our results can be safely generalized to more processors in the parallel architecture. 
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Table 1: Statistics of four document data sets. 
Date Sets D W NNZ 



KOS 3430 6906 353160 

NIPS 1740 13649 933295 

ENRON 39861 28102 3710420 

WIKI 20758 83470 9272290 




H N Number of topics 

Figure 2: Left (a): Predictive perplexity as a function of H . Middle (b): Predictive perplexity as a 
function of N. Right (c): Communication time of PGS, PBP and CE-PBP (seconds). 



4.1 Parameters for CE-PBP 

The parameter H is the slope of the Zipf 's curve in Fig. [Ta] which determines the communication 
rate for part r. Although Zipf 's law applies to many natural language data sets, some data sets do not 
fit Zipfian distribution perfectly, which can be easily validated by H . For example, the parameter 
H for KOS data set varies from 1 to 1.6. Consequently, we want to know if different H will 
influence the performance of CE-PBP. Fixing N = 16, we change H from 1 to 2 with the step 0.1 to 
investigate both training time and predictive perplexity, where the predictive perplexity on test data 
set is calculated as in lfTTl l6l. Usually, the lower predictive perplexity often corresponds to the higher 
topic modeling accuracy. Fig. [2^ shows how training time and predictive perplexity change with the 
parameters H on the KOS data set, respectively. Notice that we subtract 1150 from the value of 
predictive perplexity to fit in the same figure. Fig. [2^ shows that when H increases from 1 to 2, 
the training time decreases slowly, while predictive perplexity increases. When H = 1 in Eq. (|T), 
CE-PBP achieves the highest accuracy with a slightly more training time. So, we empirically set 
H = 1 in the rest of experiments. 

On the other hand, the number of parts N for the global parameter matrix <fi influences the topic 
modeling performance of CE-PBP. The larger N leads to the more communication cost reduction 
according to (fTTl i. However, the larger N implies that in the fixed training iteration T more sub- 
matrices communicate less frequently, degrading the overall topic modeling performance. Fixing 
H = 1, we change N from 1 to 32 with the step size 8 in the experiments. Fig. [2J3 shows that 
the effect of parameter N. While communication cost decreases with N according to (fTTT l. the 
predictive perplexity increases steadily with N because the part with higher rank r communicates 
less frequently. We empirically set N = 16 to achieve a relatively balanced performance. In this 
case, the communication cost of CE-PBP is around 20% of PBP according to ([8]) and ( fTTT ). 

Under the parameters H = 1 and N = 16, we compare the predictive perplexity between CE-PBP 
and PGS in Fig. EJ:. When the number of topics K 6 {10, 20, 30, 40, 50}, CE-PBP consistently 
achieves much lower predictive perplexity values than PGS . Such results are consistent with previous 
results on comparing BP and GS (5). As a summary, CE-PBP has a higher topic modeling accuracy 
than PGS in terms of perplexity. 



'http://archive.ics.uci.edu/ml/machine-leamingdatabases 
"http://nlp.uned.es/social-tagging/wiki 10+ 
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Figure 3: Communication time of PGS, PBP and CE-PBP (seconds). 



2600 
2400 
2200 

I 2000 
1800 

Q. 

Ol 1600 
c 

•| 1400 
H 1200 
1000 



- PGS 

- PBP 

■ CE-PBP 




100 200 300 400 500 600 
iterations 



2600 

2400 

2200 

J; 2000 

S 1800 
a 

Oi 1600 
c 

■I 1400 
H 1200 
1000 



KOS 



-PGS 
-PBP 
- CE-PBP 




20 40 60 



time(s) 



Figure 4: (a) Left: Perplexity as a function of iteration on KOS. (b) Right: Perplexity as a function 
of training time on KOS. 



Fig. [3] compares the overall communication cost of PGS, PBP and CE-PBP on four data sets in 
Table Q] On the four data sets, we find that the communication time of PBP is around twice that 
of PGS. This result follows our analysis that PBP uses double type while PGS uses integer type 
to store the global parameter matrix. The actual communication cost of CE-PBP is about 17% of 
PBP, slightly shorter than the expected value 20% according to (fTTT >. Such an improvement can be 
partly attributed to less input/output conflicts during updating the global parameter matrix. Since 
the access time of the global parameter is remarkably reduced, there are less access conflicts among 
all processors. 

Fig. 2k shows training perplexity as a function of iterations for PGS, PBP and CE-PBP on KOS. 
In the first iterations, CE-PBP converges slower than PBP, with the largest perplexity gap near 400. 
The gap quickly decreases with more iterations so that the training perplexity overlaps after 100 
iterations. The perplexity gap remains within 10 after 200 iterations for the accuracy drop within 
1%, acceptable in most applications. Fig. [4}} shows the training perplexity as a function of training 
time for PGS, PBP and CE-PBP on KOS. CE-PBP is almost twice faster than PBP and 20% faster 
than PGS. In addition, CE-PBP achieves almost the same training perplexity as PBP, which is much 
lower than that of PGS. 

Fig- Eh illustrates the speedup performance of PGS, PBP and CE-PBP on ENRON. The speedup is 
measured by Tq/(Tq/M + T c ), where M is the number of processors, To denotes the training time 
of GS or BP on a single processor, T c denotes communication cost. Fig. [5^ shows that CE-PBP 
exhibits much better speedup than PBP and PGS. Fig. [5J3 shows the corresponding computation 
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time and communication time. CE-PBP and PGS have almost the same computation time, but the 
former uses significantly smaller communication time than the latter. Fig. [5J; shows the computa- 
tion/communication ratio (CCR) for the parallel efficiency of CE-PBP. The CCR of CE-PBP is as 2 
to 3 times as that of PBP and PGS, reflecting a much better parallel topic modeling efficiency. 




Number of processors Number of processors Number of processors 

Figure 5: (a) Left: Speedup performance, (b) Middle: Computation time and communication time, 
(c) Right: Parallel efficiency measured by CCR. 

Recently, Google reports an improved version of PGS called PGS+ lfl2ll . which reduces communi- 
cation cost using four interdependent strategies including data placement, pipeline processing, word 
bundling and priority-based scheduling. The ratio of communication time of both PGS+ and CE- 
PBP to their original algorithms, PGS and PBP, should be a fair comparison. While a communication 
reduction ratio of 27.5% is reported by PGS+ (3.68 seconds and 13.38 seconds for PGS+ and PGS 
with the same settings), we achieve a much lower ratio of about 15%. Besides, CE-PBP has a lower 
predictive perplexity than PGS/PGS+, 

5 Conclusions 

To reduce the communication cost that severely affects scalability in parallel topic modeling, we 
have proposed CE-PBP that combines the parallel belief propagation (PBP) and a Zipf 's law solution 
for different communication rates. Extensive experiments on different data sets confirm that CE-PBP 
is faster, more accurate and efficient than the state-of-the-art PGS algorithm. Since many types of 
data studied in the physical and social sciences can be approximated by Zipf's law, our approach 
may provide a new way to accelerate other parallel algorithms. In future work, we shall study how 
to reduce the size K of the global parameter matrix 4>w x x in communication. Also, we plan to 
extend CE-PBP algorithm to learn more complicated topic models such as hierarchical Dirichlet 
process (HDP) [Q. 
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